Goto

Collaborating Authors

 Merced


End-to-end Multi-modal Video Temporal Grounding Yi-Wen Chen 1 Ming-Hsuan Yang University of California, Merced 2

Neural Information Processing Systems

In this supplementary document, we provide additional analysis and experimental results, including 1) more implementation details, 2) results of single-stream and two-stream models, and 3) more qualitative results for text-guided video temporal grounding. The co-attentional transformer layers in our model have hidden states of size 1024 and 8 attention heads. For the intra-modality contrastive learning, we randomly sample 3 positive videos that contain the same action as the anchor video and 4 negative videos with different action categories. Our framework is implemented on a machine with an Intel Xeon 2.3 GHz processor and an NVIDIA GTX 1080 Ti GPU with 11 GB memory. In Table 1, we present the results of single-stream models using optical flow or depth as visual input, and two-stream models using depth and RGB or depth and optical flow as the two modalities.


End-to-end Multi-modal Video Temporal Grounding Yi-Wen Chen 1 Ming-Hsuan Yang University of California, Merced 2

Neural Information Processing Systems

We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain events, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the interactions between modalities. Furthermore, we apply intra-modal self-supervised learning to enhance feature representations across videos for each modality, which also facilitates multi-modal learning. We conduct extensive experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.


Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems

arXiv.org Artificial Intelligence

Fractional-order stochastic gradient descent (FOSGD) leverages a fractional exponent to capture long-memory effects in optimization, yet its practical impact is often constrained by the difficulty of tuning and stabilizing this exponent. In this work, we introduce 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), a novel method that synergistically combines the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to automatically calibrate the fractional exponent in a data-driven manner. By continuously gauging model sensitivity and effective dimensionality, 2SED dynamically adjusts the exponent to curb erratic oscillations and enhance convergence rates. Theoretically, we demonstrate how this dimension-aware adaptation retains the benefits of fractional memory while averting the sluggish or unstable behaviors frequently observed in naive fractional SGD. Empirical evaluations across multiple benchmarks confirm that our 2SED-driven fractional exponent approach not only converges faster but also achieves more robust final performance, suggesting broad applicability for fractional-order methodologies in large-scale machine learning and related domains.


End-to-end Multi-modal Video Temporal Grounding Yi-Wen Chen 1 Ming-Hsuan Yang University of California, Merced 2

Neural Information Processing Systems

In this supplementary document, we provide additional analysis and experimental results, including 1) more implementation details, 2) results of single-stream and two-stream models, and 3) more qualitative results for text-guided video temporal grounding. The co-attentional transformer layers in our model have hidden states of size 1024 and 8 attention heads. For the intra-modality contrastive learning, we randomly sample 3 positive videos that contain the same action as the anchor video and 4 negative videos with different action categories. Our framework is implemented on a machine with an Intel Xeon 2.3 GHz processor and an NVIDIA GTX 1080 Ti GPU with 11 GB memory. In Table 1, we present the results of single-stream models using optical flow or depth as visual input, and two-stream models using depth and RGB or depth and optical flow as the two modalities.


End-to-end Multi-modal Video Temporal Grounding Yi-Wen Chen 1 Ming-Hsuan Yang University of California, Merced

Neural Information Processing Systems

We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain events, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the interactions between modalities. Furthermore, we apply intra-modal self-supervised learning to enhance feature representations across videos for each modality, which also facilitates multi-modal learning. We conduct extensive experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.


GWRF: A Generalizable Wireless Radiance Field for Wireless Signal Propagation Modeling

arXiv.org Artificial Intelligence

We present Generalizable Wireless Radiance Fields (GWRF), a framework for modeling wireless signal propagation at arbitrary 3D transmitter and receiver positions. Unlike previous methods that adapt vanilla Neural Radiance Fields (NeRF) from the optical to the wireless signal domain, requiring extensive per-scene training, GWRF generalizes effectively across scenes. First, a geometry-aware Transformer encoder-based wireless scene representation module incorporates information from geographically proximate transmitters to learn a generalizable wireless radiance field. Second, a neural-driven ray tracing algorithm operates on this field to automatically compute signal reception at the receiver. Experimental results demonstrate that GWRF outperforms existing methods on single scenes and achieves state-of-the-art performance on unseen scenes.


RoMu4o: A Robotic Manipulation Unit For Orchard Operations Automating Proximal Hyperspectral Leaf Sensing

arXiv.org Artificial Intelligence

Driven by the need to address labor shortages and meet the demands of a rapidly growing population, robotic automation has become a critical component in precision agriculture. Leaf-level hyperspectral spectroscopy is shown to be a powerful tool for phenotyping, monitoring crop health, identifying essential nutrients within plants as well as detecting diseases and water stress. This work introduces RoMu4o, a robotic manipulation unit for orchard operations offering an automated solution for proximal hyperspectral leaf sensing. This ground robot is equipped with a 6DOF robotic arm and vision system for real-time deep learning-based image processing and motion planning. We developed robust perception and manipulation pipelines that enable the robot to successfully grasp target leaves and perform spectroscopy. These frameworks operate synergistically to identify and extract the 3D structure of leaves from an observed batch of foliage, propose 6D poses, and generate collision-free constraint-aware paths for precise leaf manipulation. The end-effector of the arm features a compact design that integrates an independent lighting source with a hyperspectral sensor, enabling high-fidelity data acquisition while streamlining the calibration process for accurate measurements. Our ground robot is engineered to operate in unstructured orchard environments. However, the performance of the system is evaluated in both indoor and outdoor plant models. The system demonstrated reliable performance for 1-LPB hyperspectral sampling, achieving 95% success rate in lab trials and 79% in field trials. Field experiments revealed an overall success rate of 70% for autonomous leaf grasping and hyperspectral measurement in a pistachio orchard. The open-source repository is available at: https://github.com/mehradmrt/UCM-AgBot-ROS2


4D Metric-Semantic Mapping for Persistent Orchard Monitoring: Method and Dataset

arXiv.org Artificial Intelligence

Automated persistent and fine-grained monitoring of orchards at the individual tree or fruit level helps maximize crop yield and optimize resources such as water, fertilizers, and pesticides while preventing agricultural waste. Towards this goal, we present a 4D spatio-temporal metric-semantic mapping method that fuses data from multiple sensors, including LiDAR, RGB camera, and IMU, to monitor the fruits in an orchard across their growth season. A LiDAR-RGB fusion module is designed for 3D fruit tracking and localization, which first segments fruits using a deep neural network and then tracks them using the Hungarian Assignment algorithm. Additionally, the 4D data association module aligns data from different growth stages into a common reference frame and tracks fruits spatio-temporally, providing information such as fruit counts, sizes, and positions. We demonstrate our method's accuracy in 4D metric-semantic mapping using data collected from a real orchard under natural, uncontrolled conditions with seasonal variations. We achieve a 3.1 percent error in total fruit count estimation for over 1790 fruits across 60 apple trees, along with accurate size estimation results with a mean error of 1.1 cm. The datasets, consisting of LiDAR, RGB, and IMU data of five fruit species captured across their growth seasons, along with corresponding ground truth data, will be made publicly available at: https://4d-metric-semantic-mapping.org/


MARLP: Time-series Forecasting Control for Agricultural Managed Aquifer Recharge

arXiv.org Artificial Intelligence

The rapid decline in groundwater around the world poses a significant challenge to sustainable agriculture. To address this issue, agricultural managed aquifer recharge (Ag-MAR) is proposed to recharge the aquifer by artificially flooding agricultural lands using surface water. Ag-MAR requires a carefully selected flooding schedule to avoid affecting the oxygen absorption of crop roots. However, current Ag-MAR scheduling does not take into account complex environmental factors such as weather and soil oxygen, resulting in crop damage and insufficient recharging amounts. This paper proposes MARLP, the first end-to-end data-driven control system for Ag-MAR. We first formulate Ag-MAR as an optimization problem. To that end, we analyze four-year in-field datasets, which reveal the multi-periodicity feature of the soil oxygen level trends and the opportunity to use external weather forecasts and flooding proposals as exogenous clues for soil oxygen prediction. Then, we design a two-stage forecasting framework. In the first stage, it extracts both the cross-variate dependency and the periodic patterns from historical data to conduct preliminary forecasting. In the second stage, it uses weather-soil and flooding-soil causality to facilitate an accurate prediction of soil oxygen levels. Finally, we conduct model predictive control (MPC) for Ag-MAR flooding. To address the challenge of large action spaces, we devise a heuristic planning module to reduce the number of flooding proposals to enable the search for optimal solutions. Real-world experiments show that MARLP reduces the oxygen deficit ratio by 86.8% while improving the recharging amount in unit time by 35.8%, compared with the previous four years.


Cost-Effective Activity Control of Asymptomatic Carriers in Layered Temporal Social Networks

arXiv.org Artificial Intelligence

The robustness of human social networks against epidemic propagation relies on the propensity for physical contact adaptation. During the early phase of infection, asymptomatic carriers exhibit the same activity level as susceptible individuals, which presents challenges for incorporating control measures in epidemic projection models. This paper focuses on modeling and cost-efficient activity control of susceptible and carrier individuals in the context of the susceptible-carrier-infected-removed (SCIR) epidemic model over a two-layer contact network. In this model, individuals switch from a static contact layer to create new links in a temporal layer based on state-dependent activation rates. We derive conditions for the infection to die out or persist in a homogeneous network. Considering the significant costs associated with reducing the activity of susceptible and carrier individuals, we formulate an optimization problem to minimize the disease decay rate while constrained by a limited budget. We propose the use of successive geometric programming (SGP) approximation for this optimization task. Through simulation experiments on Poisson random graphs, we assess the impact of different parameters on disease prevalence. The results demonstrate that our SGP framework achieves a cost reduction of nearly 33% compared to conventional methods based on degree and closeness centrality.